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Abstract 

A popular approach for large scale data annotation tasks is crowdsourcing, wherein 
each data point is labeled by multiple noisy annotators. We consider the problem of 
inferring ground truth from noisy ordinal labels obtained from multiple annotators of 
varying and unknown expertise levels. Annotation models for ordinal data have been 
proposed mostly as extensions of their binary/categorical counterparts and have re- 
ceived little attention in the crowdsourcing literature. We propose a new model for 
crowdsourced ordinal data that accounts for instance difficulty as well as annotator 
expertise, and derive a variational Bayesian inference algorithm for parameter estima- 
tion. We analyze the ordinal extensions of several state-of-the-art annotator models 
for binary/categorical labels and evaluate the performance of all the models on two 
real world datasets containing ordinal query-URL relevance scores, collected through 
Amazon's Mechanical Turk. Our results indicate that the proposed model performs 
better or as well as existing state-of-the-art methods and is more resistant to 'spammy' 
annotators (i.e., annotators who assign labels randomly without actually looking at 
the instance) than popular baselines such as mean, median, and majority vote which 
do not account for annotator expertise. 

*Part of the work was done while at Yandex Labs. 
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1 Introduction 



Supervised learning tasks such as classification, regression and ranking require features as 
well as ground truth labels for the training and evaluation datasets. Unfortunately, obtaining 
ground truth labels for large datasets is an expensive endeavor. Crowdsourcing |How06] is 
an attractive solution to this problem. In this approach, one typically obtains multiple 
labels for each training instance from annotators of unknown and varying expertise levels. 
Crowdsourcing marketplaces such as Amazon's Mechanical Turk (AMT)Q enable us to collect 
labels for large datasets in a time-effective as well as cost-effective manner. The past few 
years have witnessed a significant increase in the use of crowdsourcing for large scale data 
annotation tasks in domains such as natural language processing [SOJN08J and computer 
vision [SF08] . 

Naturally, the next question is: how do we handle multiple labels for each training 
and evaluation instance during the supervised learning process? One simple approach is 
to estimate the ground truth (for instance, a weighted combination of the multiple labels) 
and use this as input to the supervised learning algorithm. Although frequently studied as 
part of a supervised learning problem, the task of estimating the ground truth from multiple 
annotations is an interesting problem on its own. For example, Dawid et al. [DS79J discuss 
the task of estimating the true response of a patient from patient records, while Smyth et al. 
[SFB + 95] discuss the task of detecting small volcanoes in Magellan SAR images of Venus. 

The critical question is then: how do we optimally combine labels from multiple anno- 
tators to form the estimate of the ground truth? Some simple heuristics for combining the 
labels are majority vote (mode), mean and median. However these do not model the fact 
that annotators can have varying expertise levels, that training instances themselves can 
have varying difficulties, as well as other characteristics of crowdsourced data. 

In this paper, we develop a probabilistic model of the data annotation process, and use 
Bayesian inference to estimate the ground truth labels. The probabilistic modeling approach 
is very flexible, and a variety of complexities in the annotation process can be incorporated. 
The probabilistic modeling approach can be used to jointly estimate ground truth labels 
and optimize the parameters of the supervised learning algorithm; cf . |YRF + 10[ |RYZ + 10] ; 



however we do not pursue this approach in this paper and leave it for future work. Existing 
approaches |BGMG12l ICar08l iDSTOl IKOSlll lRYZ+091 lRYZ+101 iRYTTl IRGP101 ISO.IN081 



IWBBP101 IWRW+091 lYRF+To] can be broadly categorized according to the following criteria: 

• Are the observed and ground truth labels binary/real/ordinal/categorical? 

• Are ground truth labels (for a subset of training instances) required for training? 

• Are annotator expertise levels modeled? 

• Are instance difficulties and/or instance features modeled? 



^ttps : / /www . mturk . com 
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One could use combinations of the ideas discussed above as well; for instance, there has been 
some work on joint modeling of annotator expertise and instance features. The above list 
is not comprehensive; for example, there are differences in the type of parameter estimation 
technique employed (optimization, expectation maximization, Bayesian inference, etc.) as 
well. 

In this work, we propose a probabilistic model for crowdsourced ordinal annotations. 
Ordinal labels arise naturally in many real world datasets, for example, movie/restaurant 
ratings and query-URL relevance in information retrieval. Annotation models for ordi- 
nal data have been proposed mostly as extensions of their binary/categorical counterparts 
[WRW + 09[ IDS79t |RYZ + Q9j. which loses the natural ordering of label values and, to the best 
of our knowledge, have neither been studied in detail nor evaluated experimentally. The 
proposed model models ordinal labels in a natural manner preserving the ordered nature 
of the label values and is in contrast to most prior work in this area, which have focused 
on binary, categorical or real- valued labels. In |RY11] . the work most similar to ours, even 
though the observed ratings are assumed to be ordinal, the ground truth labels are assumed 
to be binary. Real world annotation tasks often involve instances of varying difficulty levels 
and annotators of varying expertise levels. Crowdsourcing marketplaces typically attract 
'spammy' annotators, defined as low quality annotators who randomly guess the label with- 
out actually looking at the instance, and hence it is necessary to identify and weed out 
spammy annotators. Our model can account for varying levels of instance difficulties, which 
is useful for active learning (e.g., we could obtain more labels for the difficult instances). 
Our model can account for varying levels of annotator expertise and in addition, it explicitly 
models spammy annotators. Hence, our model is able to down-weight spammy annotators 
and effectively combine the labels from different annotators according to their expertise lev- 
els. We assume that instance features are not available. While some authors have suggested 
modeling instance difficulty using instance features, it is non-trivial in general to derive fea- 
tures that reflect instance difficulty. For instance, it is not obvious what characteristics of 
an image determine the difficulty perceived by the annotators. We assume that the ground 
truth labels are not available for training, which is typically a realistic assumption. Our 
model is very simple, with an efficient variational Bayesian inference algorithm. 

We show that our model subsumes a number of existing models and outperforms popu- 
lar baselines such as majority vote, median and mode. In addition, we explore the ordinal 
extensions of several state-of-the-art annotator models for binary /categorical labels, and sys- 
tematically evaluate the performances of the different ordinal annotation models on two real 
world information retrieval datasets. We empirically demonstrate that our model outper- 
forms or performs as well as existing approaches and is more resistant to spammy annotators. 

In Section [2j we describe our model and inference algorithm. In Section [3j we discuss 
relevant prior work and elaborate on the relationship between our model and some of the 
prior work. We report experimental results in Section [4] and conclude in Section |5j 
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2 Proposed Ordinal-discrete-mixture model 



In this section, we first describe our problem setup and the nature of the dataset. Next, 
we introduce our proposed model and present variational inference updates for parameter 
estimation. 



2.1 Problem setup 

We assume that there are iV annotators and M instances (e.g. images, query-URL pairs). 
Let r nm denote the label provided by the n th annotator to the m th instance, and z m denote 
the (unobserved) ground truth label for the m th instance. We assume z m to be real valued 
and r nm to take on values on an ordinal scale with K different values, 1,2, ... ,K. The 
proposed model can be easily modified to handle ratings on a v = {v±,V2, . . . ,%} scale, 
where vi < V2 < ■ ■ ■ < vk, and {v k} are known. Each instance is typically labeled by very few 
(<C N) annotators; hence the observations can be visualized as a M x N sparse matrix R. Let 
L denote the set of indices where the rating is observed, i.e., L = {(n, m) : r nm is observed}. 

To provide a concrete example, the Yandex dataset used in our experiments contains 
M = 10 K instances of query-URL pairs and a small subset of N = 51 annotators are 
required to assign how relevant an URL is for a specific query on a 5-point scale where 5 
represents highest possible relevance and 1 represents the lowest possible relevance. A total 
of \L\ = 40K annotations were collected, from which our problem is to estimate the ground 
truth relevance rating (which we assume exists) for each of the 10K query-URL pairs. 

We assume that the training instances which have the same difficulty can be grouped 
into categories. For instance, in the web search example, if we assume all query-URL pairs 
belonging to the same query are equally difficult, the category could refer to the query 
corresponding to each query-URL pair. Let there be C (< M) categories and let c(m) G 
{1,2, ... ,C} denote the category of the m th instance. In the case where the category is not 
observed, one might interpret category as a modeling choice that allows us to control the 



granularity level at which we model instance difficulty (cf. Section 4.4); for instance, one 
could set c(m) = m (every instance is treated separately) or c(m) = 1 (every instance is 
treated equally). 

Let t m denote the set of annotators corresponding to the m th instance, t n denote the 
set of instances corresponding to the n th annotator, and let l c > = {(n,m) G L : c(m) = c'} 
denote the set of ratings corresponding to category d . 



2.2 Ordinal-discrete-mixture model 

In this section we describe our proposed graphical model, which is shown in its entirety in 
Figure [TJ The objective of inference is to estimate the ground truth value z m for the m th 
instance. We assume a Gaussian prior for z m with density J\f (-{[/,, A -1 ), where the mean is /i 
and the precision (inverse variance) is A. We model the observed rating r nm as a draw from 
a mixture model with two components, one dependent on the ground truth z m and a second 
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Figure 1: Graphical model for Ordinal-discrete-mixture model: a, (3, //, A, 0, r], ir, {e n }^ =1 are 
parameters, {5 C }^ =1 , {r n }^ =1 , {z m }% =1 , {x nm ,y nm } nm( z L are latent variables and {r nm } nmeL 
are the observed ratings. 

'spam' component (i.e., independent of z m ) that is shared across annotators and instances. 
We denote the choice between the two components using a binary random variable y nm - The 
prior mean of y nm , denoted by e n , is annotator specific, to allow for varying levels of noise in 
the annotations of different annotators. When y nm = 0, the rating is drawn from the spam 
component distribution, which is simply a discrete distribution with probabilities w. Hence, 
1 — e n can be interpreted as the spamminess measure of the n th annotator. 

When y nm = 1, the ordinal- valued r nm is modeled as follows: we first draw x nm from 
a Gaussian distribution centered at the true value z m , then map the continuous-valued 
x nm to an ordinal r nm deterministically by simply thresholding. Let b , bi, . . . , bx denote a 
series of thresholds. Then r nm = k for the smallest k such that b^ is larger than x nm , i.e., 
bk-i < x nm < In our experiments, K = 5, and we simply fix 6 = 0.5, b\ = 1.5, b 2 = 
2.5, 63 = 3.5, 64 = 4.5, 65 = 5.5. One could learn user specific thresholds as well, but this 
causes non-identifiability in the z m values which then may not be on the same scale as 
r nm . See |RGP10| for a related discussion in the context of ordinal regression. We model the 
dependence of x nm on the ground truth z m using a Gaussian distribution with annotator and 
category specific noise precision. For simplicity, we take this precision to be T n 6 c ( m ), where 
the positive latent variable r n can be interpreted as the expertise of the n th annotator, i.e., 
higher r n implies lesser variance around the true value z m . The positive latent variable 6 C can 
be interpreted as the inverse difficulty of the c th category. If 8 C < 1, the category is difficult 
and even annotators whose ratings usually have high precision can exhibit lower precision. 
Note that S c is shared by all instances corresponding to the c th category. In summary, we 
have Xnm ~ N(-\z m , (Tn^^m))' 1 ). Finally, we impose independent gamma priors on r n and 

^c(m) ■ 
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The complete generative process is shown in Figure [TJ The conditional densities are given 
below: 

i<x nm < b Tnm ] , (1) 
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where ![•] denotes the indicator function, Q(-\ce,/3) denotes the gamma density with shape 
parameter a and inverse scale parameter and Be(-\e) denotes a Bernoulli probability with 
mean e. In the following, the hyperparameters are 6 = {a, (3, /i, A, 0, rj, e, 7r}. In our exper- 
iments, we set all the entries of 7r to 1/K, i.e., we assume that the ratings from a spammy 
annotator are uniformly distributed. In our experiments, we learn a, (3 and e using type 
II maximum likelihood within the variational Bayesian inference algorithm, while we fix 
<p = 10, rj = 5 to ensure identifiabilit}^] and fix fi to the mean of v (ordinal scale) and 
A = 0.1. 

2.3 Parameter Estimation 

The marginal likelihood of the observed annotations is given by 

p{R\0)=E TiXAX ,y \p(R\t,z,6,X,Y)]. 

Since both this and the posterior distribution are intractable, we use a variational Bayes (VB) 
algorithm |Bea03j for parameter estimation. Alternatively one may choose to use Markov 
chain Monte Carlo, but this can be slower to run and collect enough samples to estimate z m 
well. In VB, the log marginal likelihood is lower bounded by the negative variational free 
energy: 

\np(R\0) = T(q, 0) + KC (q(z, r, 6, X, Y)\\p(z, r, S, X, Y\R, 0)) , 

where p(z, r, S, X, Y\R, 0) denotes the true posterior, q(z, r, S, X, Y) the approximate 
variational posterior, and T(q,0) denotes the variational lower bound (i.e., the negative 



2 Note that there is a multiplicative degree of freedom since r„ and 5 c ( m ) only appear as the product 
T n <5 c ( m ) in the precision term. 
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variational free energy): 



F(q, 6) = E q [lnp(R, z, r, 6, X, Y\0)] + E[q]. 

Note that E 9 [-] denotes expectation with respect to the (variational) distribution q and 
H[q] denotes the entropy of q. We assume that the variational distribution q(r, z, S, X, Y) 
factorizes as follows: 

q(r, z, 5, X,Y) = H q(r n ) JJ q(z m ) J] q{S c ) } \ )• (2) 

n m c nm£L 

Note that we model the mixture indicator y nm and the continuous- valued latent variable x nm 
jointly in the variational posterior. This is because of the deterministic relationship between 
x nm and r nm when y nm = 1, which induces strong dependence between x nm and y nm in the 
true posterior. Since x nm and r nm are independent when y nm = 0, it is sufficient to keep 
track of q{x nm ,y nm ) via q{y nm ) and q[x nm \y nm = 1) only. Furthermore, since x nm cannot lie 
outside the range of the interval [b rnm -i, K nm ) when y nm = 1, we will see that the optimal 
q{xnm\y n m = 1) is simply a truncated Gaussian — its conditional prior distribution given z m 
and T n <5 c(m ) limited to the interval [b rnm -i, K nm ). 

Alternatively, the latent variable x nm can be integrated out in Q, and r nm \z m can be 
expressed as a difference of two Gaussian CDFs as in ordinal regression [CG05] . However 
treating {x nm } as latent variables leads to much simpler variational updates and we have 
found that it works sufficiently well in practice. Such an approach has been successfully 
applied for multinomial probit regression in |GR06] . 



2.4 Variational updates 

For completeness, we provide the full set of variational updates in this section. The vari- 
ational approximation leads naturally to a Gaussian posterior q(z m ), Bernoulli posterior 
q(ynm), and gamma posteriors q(r n ), q(S c ). As mentioned earlier, q(x nm \y nm = 1) has the 
form of a truncated Gaussian, whose density we denote as TAf(-\fi, a 2 , 1, u), where the Gaus- 
sian mean is /x, variance is a 2 , and the lower and upper limits are given by I, u respectively. 
We parametrize the variational posteriors using variational parameters as follows: 

q(z m ) 

q(S c ) 

qiVnm) 
q{jEnm\ynm 1) 



A/" (z m \fJ, m: 

A" 1 ), (3) 

G(T n \a n ,P n ), 
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TN (x nm \v nmi P nm i b Tnm -i, b rnm 
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The variational E-step updates are: 
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Note that an overbar denotes the expectation of the corresponding variable with respect to 
its variational distribution, i.e., z^ = /i m , z^ = \j? m + 1/A m , = u nm , = a n //3 n , <5 c(m) = 
4>c/Vc, an d x nm , x\ m denote the first two moments of the truncated Gaussian distribution 
given y nm = 1. Maximizing J-{q, 6) with respect to e n , we obtain 



\("n 



(5) 



main 



The update for a and (3 involves just maximum likelihood estimation for the gamma distri- 
bution. 



2.5 Prediction of ground truth 

The optimal prediction (in the mean-squared sense) is the posterior mean, i.e., i m = K[z m \R, 0]. 
We can approximate this using the mean under the variational posterior defined in ([3]), i.e., 

^m %"m- 



3 Related work 

In this section, we provide an overview of existing approaches for dealing with ordinal data 
and highlight the connections between the proposed model and previous models. Rogers 
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et al. |RGP10| proposed a multi-annotator ordinal regression model involving a Gaussian 
process prior over the function mapping instance features to ground truth label. In this 
paper, we assume that instance features are not observed and z m are independent, though 
in principle, one could assume a GP prior over z in the flavor of [RGP10J. When e n = l,Vn 
and 5 C = l,Vc, we obtain the model in |RGPl()j ^| Another key difference is that we use 
variational inference unlike |RGP10j who propose a Gibbs sampling algorithm, which leads 
to significant computational gains and hence our algorithm is scalable to large datasets. 

We also consider models that simplify the ordinal labels to one of the following label 
types: 

Continuous labels: In this case, the observed labels as well as the ground truth la- 
bels are assumed to be real-valued. If we remove the ordinal mapping in Ordinal-discrete- 
mixture model, i.e., set r nm = x nm , the model can produce continuous ratings. Raykar et 
al. |RYZ + 10 suggested the following model for real-valued ratings: r nm ~ N{r nm \z m , r^ 1 ). 



Note that we obtain this model when 5 C = l,Vc, i.e., instance difficulty is not modeled 
and r nm = x nm ,\/n,m,e n = l,Vn. Another key difference between Ordinal-discrete-mixture 
model and [RYZ+10] is that we impose gamma priors on r and use a variational inference 
algorithm rather than a maximum likelihood solution (which can lead to arbitrarily large 
values of r). We refer to this model as the Continuous-ML model. 

Multi-class labels: In this case, the observed labels as well as the ground truth labels are 
treated as (discrete) class labels and the relative order of the labels is ignored. We consider 
the following two models for multi-class labels: 

Dawid- Skene: this model was proposed by [DS79J. This model uses 0(K 2 ) parameters 
per annotator and does not account for instance difficulty. Further details are available in 



Appendix A.l 



GLAD: this is the multi-class extension proposed in |WRW + 09] . This model uses O(K) 
parameters per annotator and accounts for instance difficulty. Further details are available 



in Appendix A. 2 



Binary labels: Raykar et al. |RYZ + 09] suggested an extension of their binary noise model 
to ordinal labels by reducing the ordinal labels to K — 1 binary variables by defining z m k = 
t[z m > k],f nmk = l[r nm > k], 1 < k < K [FHOTj. In this case, the observed labels as well as 
the ground truth labels are treated as ordinal labels. Note that the ground truth labels are 
assumed to be real-valued in our model. While other annotator models for binary labels have 
been explored in [RYZ+091 lWRW+091 IOar08l lYRF+lOl IWBBP101 IKOSTT], for simplicity, 



we restrict our attention to the binary noise model proposed in |RYZ + 09j . We refer to this 
model as the Ord-Binary model. Further details are available in Appendix |A. 3 



Our proposed Ordinal-discrete-mixture model as well as the Continuous-ML model are ap- 
plicable only in scenarios where the relative differences in the ordinal scale {v i, . . . , Vk} 



3 Note that e n — l,Vn implies that y nm — l,Vn, m, corresponding to the case where the data is always 
generated according to the ordinal mixture component. 
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can be quantified. However, the Dawid- Skene, GLAD and Ord-Binary models are applicable 
even when the relative differences are not quantified (for instance, {v%, v 2, t^} ={cold, warm, 
hot}). 

Note that none of the models discussed above contain a mixture component for handling 
spammy ratings. To the best of our knowledge, the use of a mixture component for handling 
spammy ratings is novel. A two-component mixture model was proposed in [BGMG12] for 
modeling students' responses. If the student knows the correct answer, the observed response 
is the same as the ground truth (i.e., there is no noise model), else the observed response is 
generated from a noise distribution. However, neither the observed ratings nor the ground 
truth ratings are ordinal in |BGMG12| . 



4 Experimental results 

4.1 Dataset 

We evaluate the models on two datasets, namely the Yandex and TREC datasets. The 
Yandex dataset consists of 40,340 ratings corresponding to 51 annotators, 601 queries and 
10,462 query-URL instances collected through Amazon's Mechanical Turk. Hence, there 
are about 17.4 URLs per query and 3.85 ratings per query-URL instance on average. The 
annotators were shown a query-URL pair and asked to rate the relevance of the URL for 
that particular query on a 5-point scale, with 5 representing highest possible relevance and 
1 representing lowest possible relevance. The ground truth ratings are available for all the 
10,462 query-URL pairs and were collected from in-house expert annotators. The original 
TREC dataset used in [BMLS10J consists of 98,453 ratings corresponding to 766 annotators, 
100 queries and 20,232 query-URL instances. The ground truth ratings are available for 
only 3277 instances out of the the 20,232 query-URL instances. We slightly processed this 
datasetQto obtain a dataset containing 91,783 ratings on a 3-point scale, corresponding to 
762 annotators, 100 queries and 20,026 query-URL instances. 



4.2 Evaluation 

We use the following performance metrics to compare the methods: 



Mean squared error (MSE): jj J2 m ( z m — & ^ 2 



Pearson's correlation coefficient (Correlation): ^^=====^ , where Var(Z) de- 
notes variance of Z. 



4 The ratings were originally on a {—2, — 1, 0, 1, 2} scale where —1 corresponds to missing ground truth 
label and —2 corresponds to a broken link. We excluded the annotator ratings with value —2, and mapped 
the values from {0, 1, 2} to {1, 2, 3}. This mapping affects the value of NDCG, but does not affect the values 
of MSE and correlation. 
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• Normalized Discounted Cumulative Gain (NDCG): When the estimated ground truth 
values are used to train or evaluate a supervised ranking algorithm, we care about the 
relative differences between the ground truth estimates of URLs corresponding to the 
same query rather than the absolute values of the ground truth estimates. NDCG is 
a ranking measure that evaluates how well a list of URLs is ranked compared to the 
ideal ordering (i.e., URLs sorted in desending order of ground truth relevance values) 
|Liu09j . Note that NDCG is a query level metric unlike MSE and Correlation, which 
are query-URL level metrics. 

Lower values of MSE, and higher values of correlation and NDCG indicate better perfor- 
mance. 

4.3 Simulation details 

The inference algorithms might converge to a local optimum. Hence, for each model, we 
initialize the inference algorithm 10 times with different initializations and compute the 
performance metrics using the predictions corresponding to the parameter settings with the 
highest (variational) lower bound. We restrict the maximum number of iterations to 1000 
and stop if the absolute difference between the lower bounds, AJ r (q, 0), is less than 0.1. We 
implemented all our scripts in MATLAB. The scripts can be downloaded from the authors' 
webpages. 

4.4 Comparison of the methods 

We consider the Ordinal-discrete-mixture model with three different configurations: Ordinal- 
discrete-mixture (query-URL), where each instance is treated as a separate category (c(m) = 
m), Ordinal-discrete-mixture (query), where all URLs for a given query belong to the same 
category, and with a slight abuse of notation, Ordinal- discrete-mixture, where instance diffi- 
culty is not modeled (c(m) = 1). We compare the Ordinal-discrete-mixture model to other 
models described in Section [3] as well as simple baselines such mean, median and majority- 
votd^} The results are shown in Table [TJ with the proposed model variants highlighted in 
bold. We observe that in terms of all the metrics, the Ordinal-discrete-mixture model per- 
forms better than majority-vote, mean and median. Perhaps surprisingly, modeling instance 
difficulty does not seem to improve performance in the Yandex dataset and all three variants 
of Ordinal-discrete-mixture model perform similarly. However, modeling instance difficulty 
improves performance in the TREC dataset and the Ordinal-discrete-mixture (query-URL) 
variant performs the best. Amongst the models discussed in Section |3j the Dawid-Skene 
model achieves the best overall performance on both the datasets (although the Continuous- 
ML model, not surprisingly, achieves the lowest MSE), suggesting that reducing ordinal 
labels to multi-class labels is better than reducing them to continuous or binary labels. 

5 Note that we assumed the ordinal scale {v\, i>2, • ■ ■ , i>k} is known; hence, it is possible to compute the 
mean and median in a meaningful way. 
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Amongst the multi-class label methods (see Section [3]), the D avoid- Skene model outperforms 
the GLAD model indicating that modeling the full confusion matrix is beneficial. 
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Table 1: Comparison between different methods on Yandex and TREC datasets: For models 
that account for instance difficulty, the granularity is shown in parenthesis. The proposed 
Ordinal-discrete-mixture model (highlighted in bold) outperforms or performs as well as 
existing state-of-the-art methods. 



4.5 Effect of spammy ratings 

Real world crowdsourced data is often noisy and it is desirable to identify 'spammy' anno- 
tators and down-weight their ratings. In this experiment, we analyze the effect of spammy 
ratings on the different models. For simplicity, we just present results on the Yandex dataset 
in this experiment. For every query-URL pair in the Yandex dataset, we introduce additional 
'fake' spammy ratings drawn from an uniform distribution. We introduced fake annotators 
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and assigned these fake ratings to fake annotators such that the average number of ratings 
for a fake annotator is the same as the average number of ratings for a real annotator. We 
vary the number of fake ratings per query-URL pair from to 9 and repeat the previous ex- 
periment. The results are shown in Figure [2} We observe that models which model annotator 
expertise are more resistant to spam, and simple baselines such as mean, median, and ma- 
jority vote perform significantly worse in this experiment. In particular, we observe that the 
proposed Ordinal-discrete-mixture model is robust to spammy annotators and outperforms 
existing state-of-the-art methods discussed in Section |3j 
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Figure 2: Effect of spammy ratings on different methods (Yandex dataset): Number of addi- 
tional spammy ratings per query-URL pair verus MSE (top-left), Correlation (top-right) and 
NDCG (bottom- left). Variants of the proposed Ordinal-discrete-mixture model are shown 
in blue, existing state-of-the-art methods that account for annotator expertise (discussed in 
Section [3]) are shown in red and baselines that do not account for annotator expertise are 
shown in black. Baselines (mean, median, majority-vote) which do not account for annota- 
tor expertise perform significantly worse as the number of spammy ratings increases. The 
proposed Ordinal-discrete-mixture model outperforms other methods. 



13 



4.6 Effect of ordinal link and mixture model 



In this section, we test the effect of ordinal link function and mixture model on the Ordinal- 
discrete-mixture model. We consider four variants of the Ordinal-discrete-mixture model: 
whether ordinal mapping is used or not, and whether the spammy mixture component is 
used or not. Other details of the experimental setup are identical to Section [4~5| To avoid 



clutter, we just present results for the case where instance difficulty is not modeled (other 
granularities for instance difficulty lead to qualitatively similar trends). The results are shown 
in Figure [3j We observe that (i) the variants with the spam mixture component are robust 
to spammy ratings, and (ii) amongst the variants with the spam component, the ordinal 
likelihood model outperforms the real-valued likelihood model. This experiment illustrates 
that both the spam mixture component and the ordinal mapping in the Ordinal-discrete- 
mixture model are necessary for good empirical performance in the presence of spammy 
ratings. 
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Figure 3: Effect of spammy ratings on different variants of the Ordinal-discrete-mixture 
model (Yandex dataset): Number of additional spammy ratings per query-URL pair verus 
MSE (top-left), Correlation (top-right) and NDCG (bottom-left). See main text for addi- 
tional information. 
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5 Conclusion 



We presented the Ordinal-discrete-mixture model for multi-annotator ordinal data and a 
scalable variational inference algorithm. The proposed model encompasses several previously 
proposed models. We reviewed the ordinal extension of several state-of-the-art rating models 
for binary/categorical/real- valued data and evaluated how well the models recover the ground 
truth labels. Our experiments on two real world datasets containing query-URL relevance 
scores from AMT indicate that (i) the proposed model outperforms or performs as well as 
existing models in terms of MSE, correlation coefficient and NDCG and (ii) the proposed 
model is more resistant to spammy annotators than simple baselines which do not model 
annotator expertise. Some interesting future directions are (i) joint estimation of the ground 
truth and optimization of the (supervised) ranking model, and (ii) generalizing our model 
to account for instance features using a Gaussian process prior in the flavor of [RGPlOj . 
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A Description of related work 



A.l Dawid-Skene model 

In this case, the ordinal labels are treated as K distinct categories. A straightforward 
approach is to model the K x K confusion matrix for each annotator |DS79} |RYZ + 10 



Let 6 = {(p,ir}, where 7r is a K-dimensional prior such that = p{z m = k) and <p is a 
N x K x K matrix such that <p n kj = p{r n m — j\z m — k). Note that ■, 4> n kj' — 1 V n, k, 
hence contains NK(K — 1) free parameters. The generative process can be described as 
follows, 

p(R\z,(/))= Y[ P( r nm\Zm,<t>), (6) 
nm£L 

p(r nm \z m ,<p)^]jC r = j ^= k \ (7) 



p(z\tt) = JJp(^ m |7r) = J J 



l[z m =k] 

n 

mk 



Note that this model doesn't account for instance difficulty. Inference is based on the EM 
algorithm, treating z as the latent variables and 6 as the parameters. The E-step and M-step 
updates are given by 



A mfc = p{z m = k\R, o) oc 7T fe ii n^£r _il > 



M 

m 

^nkj OC ^ ^mk^[r nm = j}. (10) 



7r - = w Af " ; '- ( 9 ) 



The above equations need to be normalized such that ^2 k , \ m k> = 1 and J2j> <finkj' = 1. In our 
experiments, we additionally imposed a symmetric Dirichlet prior on (p n k. with concentration 
parameter a — 1. As before, the predicted estimate is the posterior mean, which is given by 

z m = E{z m \R,0] = ^2\ mk k. (11) 



A. 2 GLAD model 

In this section, we describe the multi-class extension proposed in [WRW + 09] . Note that 
[WRW + 09] did not experimentally evaluate their multi-class extension. Let a n denote the 
expertise of the n th annotator (— oo < a n < oo, where a n < implies adversarial annotator) 
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and /3 m > denote the inverse difficulty of the m th instance. The generative process is 
similar to the Dawid-Skene model, except that ^ is replaced by 

P (r nm \z m = k,a n ,? m ) = .(aA) 1 ^^) 1 ^', (12) 

where a(-) denotes the sigmoid function. Comparing the GLAD model with the Dawid-Skene 
model, we see that GLAD models just p(r nm = z m ) unlike the KiK — 1) confusion matrix 
in ([7]). However, GLAD accounts for instance difficulty. An alternative interpretation of the 
model is in terms of the log-odds ratio |WRW + 09] : the logit of the probability of a correct 
response is bilinear in a n and (3 m , i.e., 

n P^J'nm Zm \ > @m ) n /-io\ 

log _ j ^-r = a n (3 m . (13) 

-L Pynm Z m \ Q-ni Pm) 

Note that higher a n implies higher p{r nm = k), and as (3 m — > i.e., instance difficulty 
increases, p(r nm = z m ) — > 0.5 instead of ^ (corresponding to a random guess). Note that 
the model also assumes that all incorrect labels are equally likely, for example p(r nm = 
2\z m = 1) = p(r nm = K\z m = 1), which is typically not a realistic assumption. The E-step 
updates are given by: 

A m fc OC 7T fc Y[ P( r 'nm\z m = fc, «n, An), (14) 
nelm 

and A is normalized as in the previous case. The M-step updates for ir are same as (|9]). Let 
Q(a,{3) denote the lower bound on the log likelihood. We follow a co-ordinate optimization 
approach for optimizing Q(ct,/3) with respect to a. and f3. The gradients are given by 



dQ(a,P) 

da n 

dQ{a,P) 



^ ^ A mfc /3 OT ML[r nm = k]- cr(a n [3 m ) 



k mGtn 



5^ A mfe f a n (l [r nm = fc] - cr(a n /3 m )) J . 



d/3 m , !r:J . 



We use a conjugate gradient solveijj for optimizing Q(ot,{3) with respect to a and log/3. 
In our experiments, we additionally imposed priors on a n ~ A/"(l, 1), log/3 m ~ A/"(l, 1) as 
suggested by |WRW + 09j . The prediction is the posterior mean and can be computed using 



A. 3 Ord-Binary model 

A simple approach to reduce ordinal labels to binary labels is to define K—l binary variables 
as follows |FH01j : 

z mk = l[z m >k], l<k<K, (15) 



6 We used Carl E. Rasmussen's minimize .m in our experiments. The script is available at http : / 7] 
learning . eng . cam . ac . uk/ carl/ code/minimize/ 
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where the tilde indicates the K — 1 binary variables corresponding to an ordinal variable. 
Similarly, let f nmk = t[r nm > k]. Let z m . and r mk . denote the K — 1 dimensional binary 
representations of z m and r nm respectively Raykar et al. |RYZ + 09] suggested that it is 



possible to extend their two-coin binary noise model to ordinal labels using (15), but did 
not specify the exact noise model and explore the ordinal extension in detail. Extending the 
model in |RYllj to the case where the ground truth is ordinal, we consider the following 
model, 



K-l 



p(j"nm\Zmi ^) J J pi^'nink \ %"mki @~) j 



k=l 



p(fnmk\~Zmk,0) = (( 1 " fink)*™" ftk^ ) ^ ^ ("ST i 1 ~ «»*) C 1 "^) ^ (16) 

where = {a nk ,[3 nk } nk and a nk = p(f nmk = l\z mk = 1) and f3 nk = p{f nmk = 0\z mk = 0) 
denote the sensitivity and specificity of the n th annotator for the k th binary variable. Hence, 
we use 2(K — 1) parameters per annotator. For K = 3, the confusion matrix (z m . vs r nm .) 
is shown in Table [2j Note that the model assumes that r nm . can take on 2 K ~ 1 possible 
values, leading to non-zero likelihood values for some invalid combinations of r nm .. However, 
we restrict the posterior p(z m .\R) to K values by assigning zero probability to the invalid 
combinations in the prior p(z m .). Note that this model does not account for instance difficulty 

Table 2: Ord-Binary model: Confusion matrix (K x 2 K ~ l ) for the n th annotator for K = 3. 
Rows indicate true labels and columns indicate observed labels. The true labels are assumed 
to lie on an ordinal scale. The values in parenthesis indicate the (K — 1) binary variables 



defined in (15) 



True vs Observed 


1 (00) 


2(10) 


3(11) 


Invalid (01) 


1 (00) 




(1 " Pm)Pra 


(l-AaXl-fla) 




2 (10) 


(1 - a n i)/3 n2 


CK„l/3 n 2 


anl(l - Pni) 


(l-a nl )(l-M 


3(H) 


(1 - a„i)(l - a n2 ) 


a„i(l - a n2 ) 


OL n \OL n2 


(1 - a> n i)a n2 



The EM updates can be derived as follows: 

A mfc oc vr fc JJp(r nm |2 m = k,0), l<k<K, 

n 

Irak' = X mk l[z m k'], 1 < k' < K, 

k 

Q-nk 1 ^ j 

Z^m 7mfc' 

a Eml 1 ~~ Tmfc'X 1 _ Wfe') ,., -s 

Pnfc' = ^ 73 ? • (17) 



The prediction is the posterior mean and can be computed using (11). 
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